Goto

Collaborating Authors

 edge case



Using MLIR Transform to Design Sliced Convolution Algorithm

Ferrari, Victor, Pereira, Marcio, Alvarenga, Lucas, Leite, Gustavo, Araujo, Guido

arXiv.org Artificial Intelligence

This paper proposes SConvTransform, a Transform dialect extension that provides operations for optimizing 2D convolutions in MLIR. Its main operation, SConvOp, lowers Linalg convolutions into tiled and packed generic operations through a fully declarative transformation pipeline. The process is guided by a Convolution Slicing Analysis that determines tile sizes and data layout strategies based on input and filter shapes, as well as target architecture parameters. SConvOp handles edge cases by splitting irregular regions and adjusting affine maps where needed. All packing and tiling operations are derived from a parametric set of affine equations, enabling reusable and analyzable transformations. Although functional correctness was the primary goal of this work, the experimental evaluation demonstrates the effectiveness of SConvTransform, achieving good enough performance across different target architectures. Future work will focus on optimizing performance and porting to other target devices. When applied to standard convolution configurations, the generated code achieves up to 60% of peak performance on ARM SME and 67% on Intel AVX512. These results validate the benefit of combining static shape analysis with structured tiling and packing strategies within the MLIR Transform dialect. Furthermore, the modular design of SConvTransform facilitates integration with future extensions, enabling continued optimization of convolution workloads through MLIR's extensible compilation infrastructure.



Retriv at BLP-2025 Task 2: Test-Driven Feedback-Guided Framework for Bangla-to-Python Code Generation

Asib, K M Nafi, Saha, Sourav, Hoque, Mohammed Moshiul

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have advanced the automated generation of code from natural language prompts. However, low-resource languages (LRLs) like Bangla remain underrepresented due to the limited availability of instruction-to-code datasets and evaluation benchmarks. To address this, the BLP Workshop at IJCNLP-AACL 2025 introduced a shared task on "Code Generation in Bangla". In this work, we propose a method that combines instruction prompting with a test-driven, feedback-guided iterative refinement process using a fine-tuned Qwen2.5-14B model. The model generates code from Bangla instructions, tests it against unit tests, and iteratively refines any failing outputs through three evaluation passes, using test feedback to guide each step. This approach helped our team "Retriv" to secure 2nd place in the shared task with a Pass@1 score of 0.934. The analysis highlights challenges in Bangla instruction understanding and Python code generation, emphasizing the need for targeted methods in LRLs. We made experimental scripts publicly available for the community.


Operationalizing Automated Essay Scoring: A Human-Aware Approach

Plasencia-Calaña, Yenisel

arXiv.org Artificial Intelligence

This paper explores the human-centric operationalization of Automated Essay Scoring (AES) systems, addressing aspects beyond accuracy. We compare various machine learning-based approaches with Large Language Models (LLMs) approaches, identifying their strengths, similarities and differences. The study investigates key dimensions such as bias, robustness, and explainability, considered important for human-aware operationalization of AES systems. Our study shows that ML-based AES models outperform LLMs in accuracy but struggle with explainability, whereas LLMs provide richer explanations. We also found that both approaches struggle with bias and robustness to edge scores. By analyzing these dimensions, the paper aims to identify challenges and trade-offs between different methods, contributing to more reliable and trustworthy AES methods.


A Multi-Agent Framework for Stateful Inference-Time Search

Lalan, Arshika, Ghosh, Rajat, Kolsur, Aditya, Dutta, Debojyoti

arXiv.org Artificial Intelligence

Recent work explores agentic inference-time techniques to perform structured, multi-step reasoning. However, stateless inference often struggles on multi-step tasks due to the absence of persistent state. Moreover, task-specific fine-tuning or instruction-tuning often achieve surface-level code generation but remain brittle on tasks requiring deeper reasoning and long-horizon dependencies. To address these limitations, we propose stateful multi-agent evolutionary search, a training-free framework that departs from prior stateless approaches by combining (i) persistent inference-time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high-coverage edge cases across unseen codebases. Experiments show our stateful multi-agent inference framework achieves substantial gains in coverage over stateless single-step baselines, evaluated on prevalent unit-testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families - Llama, Gemma, and GPT. These results indicate that combining persistent inference-time state with evolutionary search materially improves unit-test generation.


One Whisper to Grade Them All

Phan, Nhan, Porwal, Anusha, Getman, Yaroslav, Voskoboinik, Ekaterina, Grósz, Tamás, Kurimo, Mikko

arXiv.org Artificial Intelligence

We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.


Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests

Dascalescu, Stefan, Dumitran, Adrian Marius, Vasiluta, Mihai Alexandru

arXiv.org Artificial Intelligence

Competitive programming contests play a crucial role in cultivating computational thinking and algorithmic skills among learners. However, generating comprehensive test cases to effectively assess programming solutions remains resource-intensive and challenging for educators. This paper introduces an innovative NLP-driven method leveraging generative AI (large language models) to automate the creation of high-quality test cases for competitive programming assessments. We extensively evaluated our approach on diverse datasets, including 25 years of Romanian Informatics Olympiad (OJI) data for 5th graders, recent competitions hosted on the Kilonova.ro platform, and the International Informatics Olympiad in Teams (IIOT). Our results demonstrate that AI-generated test cases substantially enhanced assessments, notably identifying previously undetected errors in 67% of the OJI 5th grade programming problems. These improvements underscore the complementary educational value of our technique in formative assessment contexts. By openly sharing our prompts, translated datasets, and methodologies, we offer practical NLP-based tools that educators and contest organizers can readily integrate to enhance assessment quality, reduce workload, and deepen insights into learner performance.


Evaluation of Large Language Models for Anomaly Detection in Autonomous Vehicles

Loukas, Petros, Bassir, David, Chatzichristofis, Savvas, Amanatiadis, Angelos

arXiv.org Artificial Intelligence

The rapid evolution of large language models (LLMs) has pushed their boundaries to many applications in various domains. Recently, the research community has started to evaluate their potential adoption in autonomous vehicles and especially as complementary modules in the perception and planning software stacks. However, their evaluation is limited in synthetic datasets or manually driving datasets without the ground truth knowledge and more precisely, how the current perception and planning algorithms would perform in the cases under evaluation. F or this reason, this work evaluates LLMs on real-world edge cases where current autonomous vehicles have been proven to fail. The proposed architecture consists of an open vocabulary object detector coupled with prompt engineering and large language model contextual reasoning. W e evaluate several state-of-the-art models against real edge cases and provide qualitative comparison results along with a discussion on the findings for the potential application of LLMs as anomaly detectors in autonomous vehicles.


"It was Tragic": Exploring the Impact of a Robot's Shutdown

Oberlender, Agam, Erel, Hadas

arXiv.org Artificial Intelligence

It is well established that people perceive robots as social entities, even when they are not designed for social interaction. We evaluated whether the social interpretation of robotic gestures should also be considered when turning off a robot. In the experiment, participants engaged in a brief preliminary neutral interaction while a robotic arm showed interest in their actions. At the end of the task, participants were asked to turn off the robotic arm under two conditions: (1) a Non-designed condition, where all of the robot's engines were immediately and simultaneously turned off, as robots typically shut down; (2) a Designed condition, where the robot's engines gradually folded inward in a motion resembling "falling asleep." Our findings revealed that all participants anthropomorphized the robot's movement when it was turned off. In the Non-designed condition, most participants interpreted the robot's turn-off movement negatively, as if the robot had "died." In the Designed condition, most participants interpreted it more neutrally, stating that the robot "went to sleep." The robot's turn-off movement also impacted its perception, leading to higher likeability, perceived intelligence, and animacy in the Designed condition. We conclude that the impact of common edge interactions, such as turning off a robot, should be carefully designed while considering people's automatic tendency to perceive robots as social entities.